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Abstract 

In a distributed network environment, the diffusion-least mean squares (LMS) algorithm gives faster convergence than the 
original LMS algorithm. It has also been observed that, the diffusion-LMS generally outperforms other distributed LMS algorithms 
like spatial LMS and incremental LMS. However, both the original LMS and diffusion-LMS are not applicable in non-linear 
environments where data may not be linearly separable. A variant of LMS called kernel-LMS (KLMS) has been proposed in 
the literature for such non-linearities. In this paper, we propose kemelised version of diffusion-LMS for non-linear distributed 
environments. Simulations show that the proposed approach has superior convergence as compared to algorithms of the same 
genre. We also introduce a technique to predict the transient and steady-state behaviour of the proposed algorithm. The techniques 
proposed in this work (or algorithms of same genre) can be easily extended to distributed parameter estimation applications like 
cooperative spectrum sensing and massive multiple input multiple output (MIMO) receiver design which are potential components 
for 5G communication systems. 


Index Terms 

KLMS, Algorithm, Diffusion-LMS, Distributed Adaptive Filtering, Massive MIMO, Cognitive Radio 


I. Introduction 

Nowadays, there is a thrust toward development of a new standard for communications called 5G, which involves some 
novel approaches like massive multiple input multiple output (MIMO), cooperative spectral sensing, visible light communication 
(VLC) etc. III. Massive MIMO uses a large number of antenna array elements (which consist of antennae at the receiver and 
those at the network nodes) which greatly increases the capacity of the communication system. Spectral sensing is a technique 
to estimate vacant spectral subbands adaptively. Such vacant subbands may be used to accommodate incoming transmission 
which saves bandwidth as we are saved from allocating a new frequency band for the incoming signal. The distributed diffusion 
based adaptive filtering algorithms have potential applications in cooperative spectral sensing and distributed MIMO detection 
11, a, 0. Hence distributed adaptive filtering/optimization over distributed networks is an important and emerging research 
area which can be applied to 5G standard components. 

Distributed signal processing deals with drawing inferences from data coming from various nodes in a given graph. Robust 
distributed algorithms are required to draw inferences from the intelligently fused data from all the nodes. The task of training 
an artificial computer to automatically draw inferences and take decisions is assigned to the statistical learning techniques. 

This paper is a preprint of a paper submitted to lET Signal Processing (special issue on 5G wireless networks) and is subject to Institution of Engineering 
and Technology Copyright. If accepted, the copy of record will be available at lET Digital Library 
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Statistical learning algorithms may be categorised into four distinct classes: a) Supervised learning, b) Unsupervised learning, 
c) Semi-supervised learning and d) Reinforcement learning Q. In supervised learning, the data labels are assumed to be known 
during training. In unsupervised learning, the data labels are not known while training. In semi-supervised learning, only a 
subset of the labels are known. In reinforcement learning, the algorithm is trained in such a way so as to maximise a utility 
function. The scope of this paper is limited to distributed supervised learning. 

One of the well known supervised learning rules is the Widrow-Hoff learning rule or the least mean squares (LMS) algorithm. 
It belongs to the class of stochastic gradient algorithms. It replaces the expectation operator in the Weiner-Hopf equation Q 
by the instantaneous gradient of the quadratic cost function. In the recent literature Q, 18], there has been a major thrust 
towards generalising the LMS algorithm in distributed environments. A variant of the widely known LMS algorithm or the 
Widrow-Hoff learning rule, called the diffusion-LMS, has been used in distributed optimisation in ||7] with wide number of 
application areas. This algorithm uses stochastic matrices to fuse the data intelligently coming from different sources (for 
example, nodes of the network) and has the best performance among all distributed counterparts of LMS algorithm ||9|, ifTOll . 
Similarly other extensions of adaptive filtering algorithms like recursive least squares (RLS) called diffusion-RLS have also 
been proposed ifTTIl . 

Classical adaptive filtering algorithms like LMS and diffusion-LMS (for networks) work well for affinely separable data. 
However, in scenarios when the data is not guaranteed to be affinely separable 0, which occurs frequently in non-linear 
scenario, the kernel least mean squares (KLMS) algorithm has been found in the literature to perform better as demonstrated 
in ifT^ and has found wide applicability as in 1131 , lfT4l . ifTSl . The basic principle of KLMS is the kernel trick Q, which maps 
the input data into a linearly separable high dimensional reproducing kernel Hilbert space (RKHS) ifT^ . Similar extensions to 
linear algorithms like affine projection algorithm to kernel spaces exist as in M- Kernel based distributed learning algorithms 
have been proposed in the literature El, am, ED- However, they neither address the kernel LMS regression problem in the 
diffusion framework nor is their performance analysed in terms of popular performance metrics. 

In this paper, we propose an extension of KLMS for distributed networks. In other words, we seek to apply the kernel 
trick to the diffusion-LMS adaptations given in IfTOl . We also seek to provide theoretical expressions that govern the proposed 
algorithm’s transient and steady state behaviour as has been done in aTOl . Il20l by classical adaptive filtering theory based 
approaches as given in 11211 . 

This paper is organised as follows: to facilitate understanding of background material and concepts forming theoretical 
basis of the proposed algorithm, the diffusion-LMS algorithm and KLMS algorithm are reviewed in Section-II and Section-Ill 
respectively. The diffusion KLMS algorithm is proposed in section-IV. To gain insights into the performance of the algorithm, 
transient performance, steady-state performance and condition for convergence are mathematically analysed in Section-V. The 
simulation results and comparison with other algorithms is provided in Section-VI, and Section-VII concludes the paper. 

H. Review of distributed Diffusion LMS 

In this section, we review the distributed diffusion-LMS given as given in IITOl . In the distributed diffusion-LMS algorithm, 
there are a set of nodes in a graph Q. The neighbourhood of a node in a graph is given by a set of nodes Q such that there 
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exists an edge between that node and the nodes in the set Q . Please note that for each node, Q also includes the node itself. 
Let stochastic matrices be given by the entries A = [aij] and C = [cij] represent a probabilistic weight from node i to node 
j. This matrix is generally determined by stochastic sampling techniques as given in cni. 

For a distributed adaptive graph indexed by time variable n, the adaptive filter attempts to estimate the local cost Jq{n) 
function at time instant n at a given node: 

= '^ClqJl{n) ( 1 ) 

ug' 

where I runs over all members of the neighbourhood of the node of the network and forms the local cost function Jg. 

For this, the distributed Weiner solution based local estimate at node q will be, w°, and is given as: 

w°q{n) = ClqR^,J-\Y^ ClqrdxiJ ( 2 ) 

IeQ' leg' 

where, Rxi„ is the autocorrelation matrix for the node in the neighbourhood of the node q of the graph, is the cross 
correlation between the desired output d and xi is the data from member of the neighbourhood of node q at time n. 

The weight vector Wq, for the q*^ node, is iteratively adapted by diffusion-LMS as follows, 

Pq{n+ 1 ) = Pq{n) + p'^Ciq{di{n) - wi{n)^xi)xi ( 3 ) 

leg' 

Wq{n + l) = '^aiqpi{n+l) (4) 

ug' 

where, p is the step-size, di{n) is the desired response at the node at the time instant and Pq{n) is the vector of 
intermediate value of the adaptive filter at node at time instant before it can be combined probabilistically over its 
neighbourhood to get the final updated estimate. 

The steps in eq. Q and (|4|i can be carried out in either order. In both situations, it will belong to the same genre of 

algorithms. If the eq. Q is carried out first it is called Adapt and Then Combine (ATC) diffusion. If the eq. (HI is carried out 

first it is called Combine and Then Adapt (CTA) diffusion ifTOl . 

Please note that an important factor in convergence of the adaptive filters is the spectral radius of the covariance matrix. 
This spectral radius is a norm in itself. Applying Jensen’s inequality to the spectral radius as in QOl, Pmax of the weighted 
covariance matrix, 

Pmax (E ^IqRlq) — ^lqPmax(,Rlq^ ^ 
leg' ug' 

where, N = \Q \ and Rig is the autocorrelation matrix of the neighbour of the node. Hence, due to lower eigen-value 
spread, it converges faster. More rigorous convergence results are found in ifTOl . 

III. Review oe KLMS 

The linear-LMS as described in 13, H minimises the following cost function at instant: 


JLMs{n) = E[(cZ(n) - w{n)'^Xnf] 


(6) 
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where Xn is the observation vector for the time instant and E[-] is the expectation operator. Dropping the expectation 
operator and taking gradient with respect to w, we arrive at the following stochastic gradient update rule ||5| . 

w{n + 1) = w{n) + fj,eLMsin)xn (7) 


where, eLMs{n) = {d{n) - w{n)'^Xn) 

When the data is not linearly separable the above adaptation does not converge to optimum value. Hence, in such scenarios, 
we invoke the kernel trick and map the vectors to RKHS as in Q by a feature map (j) : K™ TL. 

In RKHS, the adaptation can be written as follows; 

H(n) = - 1) + ^^eKLMs{n - (8) 

where H is the implicit parameter to be estimated in RKHS. This can be written as a running summation as follows: 

n — 1 

n{n) = (9) 

Taking inner product with the latest observation and assumption of zero initial conditions would give the following recursion 
as in ifT^ : 

fc-i 

y(n + 1) = >n (10) 

where. 


eKLMs{n) = {d{n) - y{n)) (11) 

is the error at instant and < denotes a real kernel inner product IfT^ on RKHS T-L. Several possibilities of kernel 

inner products exist; some of them being polynomial and Gaussian kernels Q- This algorithm has a nice self-regularising 
property, and has been studied in details in llT2l . 

IV. Proposed Diffusion-KLMS 

Based on the KLMS algorithm, reviewed in the previous section, we propose its distributed variant in this section based 
on the diffusion approach. We now define matrices and symbols that will be used in this paper. In this proposal, we have the 
matrix Y = [y{l,n)] to denote output corresponding to the neighbour at time instant. E = [e{l,n)] is the error matrix 
corresponding to the neighbour at time instant. X = [{a:;(n)}] is a matrix of measurement vectors from neighbours of 
node q at time instant n stacked together. In the following few lines, we will denote the collection of the data from various 
nodes at the time instant as X{n). X(n) contains the data pertaining to all I neighbours stacked in row vector form. In case, 
there is no vector from a node in the neighbourhood it is replaced by the zero vector in X (n) and will have a corresponding 
0 entry in C. 
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The gradient from eq. ([3]) is redefined as: 

Vp,Jg(n) = ea,n)>(CX(n)) (12) 

where (j){.) is a feature map from TL, where d is the dimensionality of the data and TL is an RKHS. Applying the kernel 

trick results in, 

n—1 

y((, n + 1) = p ^ e(/,n) < CX(i),X{n) >u (13) 

i=0 

e(9, n + 1)' = ^ a{q, l)di{n) - ^ a(g, l)y{l, n) (14) 

ug' leg' 

where A is a stochastic matrix corresponding to the probabilistic weights {a{q,l)}. The error at time instant at the 
node would be the (transformed) mean (by A) of e over all possible 1. 

The proposed algorithm is given below, as iterating following three steps, till convergence: 

1) Estimate the outputs of node I using estimates of error ei{n). 

2) Form an estimate of errors at time instant n at each node 1. Let this be given by the vector e(n) whose element is 
e{l, n). Then the error term for the node for the time instant can be written as e{l, n) = d(n) — y{l, n) 

3) The error at each node is modified by the transformation A by the equation e (n + 1) = Ae{n), where e(n) and e (n) 
are vectors of error terms corresponding to all the nodes (for all nodes indexed by 1) stacked together. 

V. Transient and Steady State Performance 

In this section, we provide the steady state analysis of the proposed algorithm based on the classical approach outlined in 
ED (analysis based on eigenvalues of autocorrelation matrices). We note that the proposed recursion for the node can be 
expressed in RKHS as follows: 


flq(n) = flq(n - 1) - ^ eq{n)ciq4>{xi) (15) 

v; 

yq{n) =< ^q{n),(j){Xobs) >H 
eq{n) = dq{n) - yq{n) 

eq{n + 1) = y^a;ge;(n) 
v; 

Vlq is an implicit parameter which is learned in RKHS and Xobs is an input observation. Let the optimal value of the parameter 
be 17° and the deviation of the implicit parameter from the optimal value in q^^ node at instant be denoted as Clq{n). 
Subtracting 17° from both sides of first equation of (ffSl l. we get: 


Ctq{n) = Ctq{n - 1 ) - ^ eq{n)ciq(j){xi) 


( 16 ) 
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Taking inner product on both sides of the above equation with (j){xi), 

Vqin) = yqin - l) - ciq{yq{n - 1) + Ug) < (j){xi),(j){xobs) >H (17) 

Vi 

= Clq < 4>{xi),4>{Xobs) >n)yq(n- 1) - y'^ClqUq < (j){xi),(j){Xobs) >W 

Vi i 

Please note that yq{n) is calculated after combination by the A matrix in the last step of eq. (fTSl) . Define a matrix Ai = 
A ® lo and Ci = C ® Id, where A and C are combining matrices and Id is a D x D identity matrix; where D is 
the cardinality of the network. Further we define two vectors namely, $(a:) = [(j){xi),(j){x 2 ), ■■■,(!>{xd)]'^ and $(xof,s) = 
[(j){Xobs),(l>ixobs), ■■■,4>ixobs)]'^■ Using above defined variables, we rewrite (fTTI i as, 

yq{n) = (1 - p < C'i$(a:), >-H)Vq{ri - 1) - y < C'i$(a:), $(a:oi,s) >h nq (18) 

Squaring both sides, taking expectation, and considering only till the first power of y, we get; 

E[| 2 /g(n)n = [l-2y< Ci<^{x),<^{xobs) >«]E[|yg(n- l)p] + y'^alE{\ < C'i$(a:), ^(xobs) >n P) (19) 

Based on ( fT9] l we derive the transient behaviour, steady state behaviour and condition for convergence of the proposed algorithm. 

A. Transient behaviour 

To estimate the speed of convergence of the proposed approach it is essential to gain insight into the dynamical equation 
that governs the evolution of the learning curve vs number of iterations. 

The above dynamical equation (fT9] l controls the transient behaviour at small step-sizes. The inner product < 4>{x),(t){y) >-H 
depends on choice of kernel. As we use a real Gaussian kernel as done in ifT^ . 

< 4'{x),(l){y) >n= -^=^exp(-li^L_^) (20) 

where x,y G and (j) : x ^ (j){x) is a feature map from the vector space of real numbers to RKHS. Using the definition of 
< V >n given in dSol i in (fT^ we get the transient behaviour of the proposed approach. We see that for a given y and noise 
variance cr^ the transient behaviour of the proposed approach can be easily modeled using ( fT^ . 

B. Steady state behaviour 

It is also essential to see how the MSE floor to which the proposed algorithm has converged varies with step-size. From 
(O, assuming convergence (E[|yq(n)p] ~ E[\yq{n — 1)^]), we arrive at the following expression for misadjustment, 

2 

wn = < cMx), <i>{xobs) >H I) (21) 

Thus we can see that the above equation (derived for non-linear systems) is similar to equation derived in ED for the 
Widrow-Hopf learning rule for single node for linear parameter estimation. 
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C. Step-size range for convergence 

For any adaptive algorithm it is very important to set up the step-size, p,, in the range in which the algorithm converges. If 
p is too less, we may observe slow convergence. Too high a p may result in mis-convergence. 

The proposed algorithm converges iff the following condition holds, 

[1 - 2^ < ^Xobs) >n >„ p] < 1 (22) 

^ n - - ^ 

^ < Ci<i>{x),<i>{xobs) >H 

Hence, if p is in the above range then the proposed algorithm converges. This bound (derived for non-linear systems) is 
similar to the general case of the bound of the convergence of step-size for Widrow-Hopf learning rule for a single node linear 
scenario. 


VI. Results 

In this section, we present the simulation results based on the analysis presented in previous sections. An independently 
identically distributed (i.i.d) sequence {±1} was generated. Consequently, this sequence was passed through a non-linearity 
f{x) = X — 0.9x^ as in ifT^ so as to simulate a non-linear system. Further, additive white Gaussian noise of variance 
0.16 was added. In other words, we considered a simple de-noising problem for our simulations. The convergence and error 
performance of KLMS and diffusion-KLMS are shown in Fig. [T] for A = C = [0.5 0.5; 0.5 0.5] and in Fig. |2] for A = 
[0.666 0.333; 0.333 0.666],C = [0.5 0.5; 0.5 0.5]. We see that although the LMS and diffusion LMS perform well in 
linear channels, they fail to converge in non-linear channels. We observe superior convergence to lower MSE floors is case 
of diffusion-KLMS as compared to KLMS, LMS, diffusion LMS and diffusion-RLS. We use p = 0.2 and spread parameter 
a = 0.1 for KLMS and the proposed KLMS based approach. For LMS and diffusion-LMS, step-size p = 0.02 is used for 
simulation. We observe performance gain of two decades of the proposed approach with respect to LMS and diffusion-LMS. 
Also, we hnd a gain of a decade of performance with respect to single-node KLMS. We observe that the linear RLS exhibits 
poor performance in a non-linear scenario as the covariance matrix updation fails due to non-linearity. 

In Fig. 0 the steady state behaviour of diffusion KLMS as a function of step-size where the theoretical curves, which are 
obtained from Section-V, are observed to be close to the experimental curves. The computational complexity of training phase 
of the proposed scheme is 0{D‘^\Q\) and testing computational complexity is 0{D\Q\) as the computational complexity of 
the training and testing phases are given as 0{D^) and 0{D) respectively as in ifT^ where D is the dimensionality of the 
observations. 

From Fig.m we hnd that the proposed modeling of the transient behaviour of the MSE curve closely matches the experimental 
transient behaviour for diffusion-KLMS. Please note that the dynamical modeling for the algorithm is more accurate in the 
transient region of the plots. The transient region is generally specihed by the time taken by the MSE plot to decay to exp(—1) 
of its initial value EH, which is also called time-constant of the adaptation. We see almost perfect modeling of MSE plot 
within the range of the time constant. 
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To study how MSB evolves as we remove or add another node in the network (or in another words change the network size), 
we plot the experimental MSB floor as a function of network size in Big. |5] We see that as the network size increases the MSB 
floor decreases which is an intuitive result. Burther, we compare the MSB floor obtained experimentally with the theoretical 
expression for the same A, C matrices for the given network size. We average over 1000 iterations with various choices of A 
and C, and plot their mean values both for theoretical and experimental MSB floors as a function of the network size. We see 
that the MSB floors as predicted by theoretical expressions derived in Section-V follow the experimentally obtained curves as 
we increase the size of the network. 


VII. Conclusion 

A new variant of KLMS algorithm has been proposed which is a distributed solution to the non-linear KBMS algorithm. 
The proposed algorithm converges to a lower MSB floor as compared to the original KBMS algorithm as shown in this 
paper. Theoretical expressions for both transient and steady-state performance have been derived which closely match with the 
experimental values. Hence, the proposed diffusion-KBMS is a better adaptive algorithm for estimation as compared to KBMS 
in distributed non-linear systems. This work has potential applications in non-linear distributed inference over some targeted 
5G network’s components like detection over massive MIMO and cooperative spectrum sensing for cognitive radio. 
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Fig. 1. Convergence plot for LMS, Diffusion-LMS, Diffusion-RLS, KLMS and Diffusion KLMS: A = C = [0.5 0.5; 0.5 0.5] 



Fig. 2. Convergence plot for LMS, Diffusion-LMS, Diffusion-RLS, KLMS and Diffusion-KLMS comparison: A = [0.666 0.333; 0.333 0.6661,(7 = 

[0.5 0.5; 0.5 0.5] 
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Step-Size 



Step-Size 


Fig. 3. MSE floors comparison for Diffusion-KLMS :Theoretical and Experimental; Setup-I: A = [0.5 0.5; 0.5 0.5], C = [0.5 0.5; 0.5 0.5], 

Setup-II: A = [0.666 0.333; 0.333 0.666], C = [0.5 0.5; 0.5 0.5] 
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Setup-I 



Setup-ll 



Fig. 4. Transient behaviour at step-size 0.12, Setup-I: A = [0.5 0.5; 0.5 0.5], C = [0.5 0.5; 0.5 0.5], Setup-II: A = [0.666 0.333; 0.333 0.666], 

C = [0.5 0.5; 0.5 0.5] 
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Network cardinality 



Fig. 5, 


Variation of MSE floor with number of nodes for SNR of lOdB and 20dB 





























































































